Plagiarism Detection

Project Overview

In this project, you will be tasked with building a plagiarism detector that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar the text file is to a provided source text.

This project will be broken down into three main notebooks:

Notebook 1: Data Exploration

Load in the corpus of plagiarism text data.
Explore the existing data features and the data distribution.
This first notebook is not required in your final project submission.

Notebook 2: Feature Engineering

Clean and pre-process the text data.
Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
Select "good" features, by analyzing the correlations between different features.
Create train/test .csv files that hold the relevant features and class labels for train/test data points.

Notebook 3: Train and Deploy Your Model in SageMaker

Upload your train/test feature data to S3.
Define a binary classification model and a training script.
Train your model and deploy it using SageMaker.
Evaluate your deployed classifier.

Getting the Project Materials

You have been given the starting notebooks in a Github repository, linked below.

Since this project uses SageMaker, it is suggested that you create a new SageMaker notebook instance using your AWS console and link it to the Github repository https://github.com/udacity/ML_SageMaker_Studies .

The project files are in the Project_Plagiarism_Detection directory.

You should complete each exercise and question; your project will be evaluated against this rubric .

Project Evaluation

You will be graded on your implementation of a plagiarism detector as well as complete answers to any questions in the project notebook. You'll submit a zip file or Github repo that includes complete notebooks, with all cells executed, and you'll be graded according to the project rubric.

Exploring the Data

Before starting the project, you are given the option to explore the plagiarism data you'll be working with, in the next workspace .

Next Concept